{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "2c769f5e",
   "metadata": {},
   "source": [
    "# Use the Pandas String-Only get_dummies Method to Instantly Restructure your Data\n",
    "\n",
    "In this post, you'll learn how to use the fantastic `str.get_dummies` Pandas Series method to instantly restructure trapped data within a string. We begin by reading in a small sample dataset containing people's favorite fruits."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "f739ad95",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>fruits</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Ana</td>\n",
       "      <td>mango|orange|pear|nectarine|banana</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Bill</td>\n",
       "      <td>orange|peach</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Calvin</td>\n",
       "      <td>pear|mango</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Dean</td>\n",
       "      <td>mango|apple</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Elias</td>\n",
       "      <td>nectarine|pear|mango</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Felicia</td>\n",
       "      <td>mango|pear</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>George</td>\n",
       "      <td>mango|pear|orange|nectarine|banana</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Henry</td>\n",
       "      <td>orange|banana</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      name                              fruits\n",
       "0      Ana  mango|orange|pear|nectarine|banana\n",
       "1     Bill                        orange|peach\n",
       "2   Calvin                          pear|mango\n",
       "3     Dean                         mango|apple\n",
       "4    Elias                nectarine|pear|mango\n",
       "5  Felicia                          mango|pear\n",
       "6   George  mango|pear|orange|nectarine|banana\n",
       "7    Henry                       orange|banana"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "df = pd.read_csv('data/fruits.csv')\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1abb3055",
   "metadata": {},
   "source": [
    "Notice that the fruit column has multiple fruit names in each cell separated by a pipe character. While this format compactly represents the data, it isn't the most suitable for answering basic questions such as:\n",
    "\n",
    "* What is the number of fruit per person?\n",
    "* How many people enjoy banana?\n",
    "* Which people enjoy both oranges and bananas?\n",
    "* How any fruits are in-common with each person?\n",
    "\n",
    "## Attempt to answer questions in current form\n",
    "\n",
    "It's possible to answer these questions in the current format using Pandas, though, we will see how these ways are sub-optimal. Here we find the number of fruit per person by adding one to the count of the pipe characters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "51129f28",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "name\n",
       "Ana        5\n",
       "Bill       2\n",
       "Calvin     2\n",
       "Dean       2\n",
       "Elias      3\n",
       "Felicia    2\n",
       "George     5\n",
       "Henry      2\n",
       "Name: fruits, dtype: int64"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s = df.set_index('name')['fruits']\n",
    "s.str.count(r'\\|') + 1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0660a7d8",
   "metadata": {},
   "source": [
    "The number of people who enjoy banana."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "be978c1d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s.str.contains('banana').sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8bb18d20",
   "metadata": {},
   "source": [
    "The people that enjoy both oranges and banana."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "7b21ade2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "name\n",
       "Ana         True\n",
       "Bill       False\n",
       "Calvin     False\n",
       "Dean       False\n",
       "Elias      False\n",
       "Felicia    False\n",
       "George      True\n",
       "Henry       True\n",
       "Name: fruits, dtype: bool"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s.str.contains('(?=.*orange)(?=.*banana)')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d1979391",
   "metadata": {},
   "source": [
    "Finding the number of fruits in-common with both people is particularly difficult and is a clear case for reformatting the data.\n",
    "\n",
    "## Better formatting of the data\n",
    "\n",
    "All of these questions can be bettered answered if the data is in a different format. The `get_dummies` string-only method will split all values in a single cell into their own columns creating 0/1 indicator variables. Here, we pass in the pipe character to `get_dummies`, producing the following DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "ac16cc0f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>apple</th>\n",
       "      <th>banana</th>\n",
       "      <th>mango</th>\n",
       "      <th>nectarine</th>\n",
       "      <th>orange</th>\n",
       "      <th>peach</th>\n",
       "      <th>pear</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>name</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Ana</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Bill</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Calvin</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Dean</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Elias</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Felicia</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>George</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Henry</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         apple  banana  mango  nectarine  orange  peach  pear\n",
       "name                                                         \n",
       "Ana          0       1      1          1       1      0     1\n",
       "Bill         0       0      0          0       1      1     0\n",
       "Calvin       0       0      1          0       0      0     1\n",
       "Dean         1       0      1          0       0      0     0\n",
       "Elias        0       0      1          1       0      0     1\n",
       "Felicia      0       0      1          0       0      0     1\n",
       "George       0       1      1          1       1      0     1\n",
       "Henry        0       1      0          0       1      0     0"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1 = s.str.get_dummies('|')\n",
    "df1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ff5cfaa1",
   "metadata": {},
   "source": [
    "We can now answer the same questions as before. Here, we count the number of fruit for each person."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "e067a735",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "name\n",
       "Ana        5\n",
       "Bill       2\n",
       "Calvin     2\n",
       "Dean       2\n",
       "Elias      3\n",
       "Felicia    2\n",
       "George     5\n",
       "Henry      2\n",
       "dtype: int64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1.sum(axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "83591d8b",
   "metadata": {},
   "source": [
    "We sum up a single column to count the total number of people who enjoy bananas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "c2011daa",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1['banana'].sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93037916",
   "metadata": {},
   "source": [
    "Here, we use the `query` method to select each person who likes both oranges and bananas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "46f01ac7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>apple</th>\n",
       "      <th>banana</th>\n",
       "      <th>mango</th>\n",
       "      <th>nectarine</th>\n",
       "      <th>orange</th>\n",
       "      <th>peach</th>\n",
       "      <th>pear</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>name</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Ana</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>George</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Henry</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        apple  banana  mango  nectarine  orange  peach  pear\n",
       "name                                                        \n",
       "Ana         0       1      1          1       1      0     1\n",
       "George      0       1      1          1       1      0     1\n",
       "Henry       0       1      0          0       1      0     0"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1.query('orange + banana == 2')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c15f8c3",
   "metadata": {},
   "source": [
    "Finding the number of fruits in-common with each person is where the largest gain from restructuring comes from. Here, we use the matrix multiplication operator to multiply the DataFrame to itself."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "270b78d2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>name</th>\n",
       "      <th>Ana</th>\n",
       "      <th>Bill</th>\n",
       "      <th>Calvin</th>\n",
       "      <th>Dean</th>\n",
       "      <th>Elias</th>\n",
       "      <th>Felicia</th>\n",
       "      <th>George</th>\n",
       "      <th>Henry</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>name</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Ana</th>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Bill</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Calvin</th>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Dean</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Elias</th>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Felicia</th>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>George</th>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Henry</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "name     Ana  Bill  Calvin  Dean  Elias  Felicia  George  Henry\n",
       "name                                                           \n",
       "Ana        5     1       2     1      3        2       5      2\n",
       "Bill       1     2       0     0      0        0       1      1\n",
       "Calvin     2     0       2     1      2        2       2      0\n",
       "Dean       1     0       1     2      1        1       1      0\n",
       "Elias      3     0       2     1      3        2       3      0\n",
       "Felicia    2     0       2     1      2        2       2      0\n",
       "George     5     1       2     1      3        2       5      2\n",
       "Henry      2     1       0     0      0        0       2      2"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1 @ df1.T"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c18f120b",
   "metadata": {},
   "source": [
    "## Master Data Analysis with Python\n",
    "\n",
    "If you enjoyed this tip and area looking to become an expert with Pandas, then check out my extremely comprehensive book, [Master Data Analysis with Python](https://dunderdata.com/master-data-analysis-with-python). It is the most comprehensive Pandas book available, comes with 500+ exercises, video tutorials, and certification exams."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
